Kyzylorda Region
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- North America > United States > Florida > Brevard County > Cape Canaveral (0.04)
- (3 more...)
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Laiyk, Nurkhan, Orel, Daniil, Joshi, Rituraj, Goloburda, Maiya, Wang, Yuxia, Nakov, Preslav, Koto, Fajri
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
- North America > United States (0.14)
- Asia > Russia (0.14)
- Asia > Kazakhstan > Akmola Region > Astana (0.04)
- (18 more...)
- Research Report (1.00)
- Personal (1.00)
- Law (1.00)
- Health & Medicine (1.00)
- Banking & Finance (0.93)
- (6 more...)
KazQAD: Kazakh Open-Domain Question Answering Dataset
Yeshpanov, Rustem, Efimov, Pavel, Boytsov, Leonid, Shalkarbayuli, Ardak, Braslavski, Pavel
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.
- Asia > Russia (0.14)
- North America > United States (0.14)
- Asia > Kazakhstan > Akmola Region > Astana (0.04)
- (20 more...)
- Research Report (0.64)
- Overview (0.46)
- Education (1.00)
- Information Technology (0.88)
- Leisure & Entertainment > Sports (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.86)
A Probabilistic Programming Approach To Probabilistic Data Analysis
Probabilistic techniques are central to data analysis, but different approaches can be challenging to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include discriminative machine learning, hierarchical Bayesian models, multivariate kernel methods, clustering algorithms, and arbitrary probabilistic programs. We demonstrate the integration of CGPMs into BayesDB, a probabilistic programming platform that can express data analysis tasks using a modeling definition language and structured query language. The practical value is illustrated in two ways. First, the paper describes an analysis on a database of Earth satellites, which identifies records that probably violate Kepler's Third Law by composing causal probabilistic programs with nonparametric Bayes in 50 lines of probabilistic code. Second, it reports the lines of code and accuracy of CGPMs compared with baseline solutions from standard machine learning libraries.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- North America > United States > Florida > Brevard County > Cape Canaveral (0.04)
- (3 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation
Li, Bryan, Rasooli, Mohammad Sadegh, Patel, Ajay, Callison-Burch, Chris
We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 40 languages to English. We find this model can generalize to zero-shot translations on unseen languages. For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then bidirectionally train with successive rounds of back-translation. Our approach, which we EcXTra (English-centric Crosslingual (X) Transfer), is conceptually simple, only using a standard cross-entropy objective throughout. It is also data-driven, sequentially leveraging auxiliary parallel data and monolingual data. We evaluate unsupervised NMT results for 7 low-resource languages, and find that each round of back-translation training further refines bidirectional performance. Our final single EcXTra-trained model achieves competitive translation performance in all translation directions, notably establishing a new state-of-the-art for English-to-Kazakh (22.9 > 10.4 BLEU). Our code is available at https://github.com/manestay/EcXTra .
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- (9 more...)
NASA condemns Russian cosmonauts for displaying anti-Ukraine propaganda on ISS
NASA has issued a fierce condemnation of the Russian space agency after three cosmonauts displayed anti-Ukraine propaganda aboard the International Space Station. The trio were seen holding flags of the Luhansk People's Republic and the Donetsk People's Republic -- two Russian-backed separatist regions in eastern Ukraine that are only recognised as independent states by Moscow and Syria. They also said the capture of the region was'a liberation day to celebrate both on Earth and in space.' In response to the pictures, posted by Russia's state space corporation Roscosmos, NASA said it'strongly rebukes Russia using the International Space Station for political purposes to support its war against Ukraine.' Press secretary Jackie McGuinness added that it was'fundamentally inconsistent with the station's primary function among the 15 international participating countries to advance science and develop technology for peaceful purposes.' Rebuked: NASA has condemned the Russian space agency after three cosmonauts displayed anti-Ukraine propaganda on the International Space Station.
- North America > United States (1.00)
- Asia > Russia (0.82)
- Europe > Ukraine > Luhansk Oblast > Luhansk (0.28)
- (10 more...)
- Government > Space Agency (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Russia's 'walking arm' robot successfully docks with International Space Station after eight days in space
Russia's long-delayed lab module successfully docked with the International Space Station on Thursday, eight days after it was launched from the Russian space launch facility in Baikonur, Kazakhstan. The 20-metric-ton (22-ton) Nauka module, also called the Multipurpose Laboratory Module, docked with the orbiting outpost in an automatic mode after a long journey and a series of manoeuvres. Russia's space agency, Roscosmos, confirmed the module's contact with the International Space Station at 13:29 GMT. It carried with it the European Robotic Arm (ERA) payload, which can handle components up to 8000 kilograms and transport astronauts. The launch of Nauka, which is intended to provide more room for scientific experiments and space for the crew, had been repeatedly delayed because of technical problems.
- Europe > Russia (0.88)
- Asia > Russia (0.88)
- Asia > Kazakhstan > Kyzylorda Region > Karmakshy District > Baikonur (0.26)
- Asia > Japan (0.06)
Russia launches new 'walking' robot arm module to the International Space Station
A Proton rocket launched from the Baikonur Cosmodrome in Kazakhstan today, taking the European Robotic Arm (ERA) payload to the International Space Station. The 11-meter long robot has been folded and attached to the Multipurpose Laboratory Module, also called'Nauka', that will be its home base when it reaches the ISS. The rocket put Nauka and the ERA into orbit at 16:08pm GMT, ten minutes after liftoff, at an altitude of nearly 200 kilometres above the Earth. The ISS already has two robotic arms, which are used to berth spacecraft and transfer payloads and astronauts, but neither arm can each the Russian segment, the European Space Agency said. Instead, the ERA will'walk' around the Russian parts of the orbital complex, handling components up to 8000 kilograms, and transport astronauts when it eventually reaches the station.
- Europe > Russia (0.40)
- Asia > Russia (0.40)
- Asia > Kazakhstan > Kyzylorda Region > Karmakshy District > Baikonur (0.27)
Russia is launching a new module for the International Space Station
Russia is launching a new module for the International Space Station (ISS), after more than a decade of delays. The Nauka module is set to lift off from Baikonur Cosmodrome in Kazakhstan on top of a Proton-M rocket at around 1500 GMT today, along with a new robotic arm for the station created by the European Space Agency. The ISS is composed of modules and equipment from different space agencies including Europe, Japan and Canada, but the bulk of the station is composed of two main sections, a Russian segment and a US segment. At 13 metres long and weighing more than 20 tonnes, Nauka, also called the Multipurpose Laboratory Module, will be among the largest in Russia's half. After launch, Nauka will take eight days to reach the ISS.
- Europe > Russia (0.93)
- Asia > Russia (0.93)
- Asia > Kazakhstan > Kyzylorda Region > Karmakshy District > Baikonur (0.26)
- Government > Space Agency (1.00)
- Government > Regional Government > North America Government > United States Government (0.34)
Russian cargo ship will narrowly avoid a collision with a SpaceX Starlink satellite tonight
A Russian cargo ship on its way to the International Space Station (ISS) will come perilously close to colliding with one of SpaceX's satellites, according to the country's space agency Roscosmos. The Progress 78 spacecraft, which blasted off from the Baikonur cosmodrome in Kazakhstan on Tuesday, will also narrowly miss a Falcon 9 rocket fragment left in orbit from 2020. Preliminary calculations suggest the Starlink 1691 satellite will come within 0.9 miles (1.5km) of hitting Progress at 17:32 ET (22:32 BST) tonight, while the booster is expected to miss by 0.3 miles (500m) three minutes later. Near-miss: Russian cargo ship Progress 78 (similar to the one pictured) will come very close to colliding with one of SpaceX's satellites, according to the country's space agency Roscosmos Starlink 1691 was launched in September last year but is understood to have been lowered out of operational orbit at 340 miles in April. This artist's impression shows a Starlink satellite The close approach will take place just three-and-a-half hours before the spacecraft is set to dock with the ISS at 21:02 ET (02:02 BST) on July 2. Roscosmos said: 'Preliminary data show the Starlink 1691 satellite approach the Progress MS-17 spacecraft at 21:32 UTC at a distance of about 1.5 km.
- Government > Space Agency (1.00)
- Aerospace & Defense (1.00)
- Transportation > Marine (0.82)
- Transportation > Freight & Logistics Services > Shipping (0.82)